As students of economics, we have studied many different industries and market structures in order to understand how people try to maximize their profits. In many of the models that we have studied for our prior classes, pricing of goods and services was straightforward, and we had clear evidence showing why a producer would want to sell their product at certain quantities and prices. But when it comes to airlines selling flights, how firms maximize their profit gets a little more ambiguous. We know that airlines compete as an oligopoly market, so they want to collude but can’t, which begs the question what is their next best alternative? What methods can they employ to get an edge over their competitors and maximize their profit when their best option is unavailable? Anyone that has tried to buy a ticket for an airplane has probably realized that not every ticket is priced the same and that some people get better deals than others. We know that one method that airlines use to maximize profit is price discrimination, but that does not tell us anything about their revenue relative to their costs. In other words, knowing that a businessperson will typically pay more for a ticket does not tell us all that much about what kind of money the airline is going to make overall. Are airlines getting the most they can out of flights, and what incentives are there to offer deals and discounts? These are important questions when we want to learn more about the industry.
We also want to know what kind of deals that airlines make to increase profit, for example we will look at if airlines offer bulk discounts for groups buying tickets in massive quantities. Or if the airlines will offer slightly better deals on round-trip flights so they can make money by taking a person to a place and bringing them back. These are some standard practices offered in some other industries and we wanted to see if these are applicable to this industry like others, and if that is the case then it might give us as consumers innovative ideas on how to exploit those deals offered by the airlines.
To state the plan explicitly, we will be taking airline data and will attempt to use flight characteristics to create a regression to determine what kind of revenue airlines will be able to make off a given flight. One of the main variables we will be trying to predict is itinerary yield, which represents the amount of revenue that a flight will bring in. We will look at how the revenue is affected by distance and how much a flight can make per mile. That will give us an idea of whether or not airlines offer discounted rates to those flying long distances. Our hope is that our regression will be able to answer all these questions and more, as well as provide a stable model for future research. Or at the very least, will provide a rough basis to be polished and refined further at a later date.
Below is a visual that shows the type of study we originally wanted to conduct. Their data is not open source and because finding data of this type was not possible we changed the data and remainder of the project.
We can see a rough answer to our original question, “When is the cheapest time to buy airline tickets? Does pricing change significantly as demand for flights varies? Do airlines vary price to combat strategic consumers?”… Yes ticket price is extremely elastic.
Cheapair.com
Data Source: Source
To provide an answer to the question “can we predict revenue earned from flights based off their characteristics?”, a list of variables that have significant effects on the price of airline tickets need to be evaluated. Using research and analysis from others, who have examined individual factors and price determinates, will be a basis for what data should be obtained and applied to create a new regression study. The primary factors to be examined are scarcity of seats, the impact of oil prices and consumer strategies on the price of the seat.
Much of the research is based on the interactions between consumers and airlines pricing. To begin, it was necessary to find evidence that consumers within the airline ticket market are strategic as assumed. (i.e. Are the assumed interactions between price and consumers reasonable, or are they a result of other hidden factors?) Li and Netessine (2014) provide an analysis of this base question. Their research seeks to answer the idea of whether consumers are strategic within the airline industry to a significant degree, and secondarily what effects this may have on the revenues of airlines. Due to their work, we can say with relative confidence that strategic consumers do exist within this market space, specifically with an estimated share of 5.2% to 19.2% of the consumer market. (Li et al, 2014) The effects on total revenue due to their presence are more complicated, they found that nondecreasing price commitment strategies can reduce the level of strategic consumers although these same strategies lead to decreased levels of indifferent consumers as well (Li et al, 2014) (i.e. Those who buy tickets simply because they are cheap, rather than waiting for prices to drop before buying). These findings show the effects of strategic consumers on revenue may be a relationship worth investigating as if understood it would allow for more exact price strategies that maximize revenues by controlling and shifting market strategies based on flight patterns, trends, and consumers inputs.
This leaves the question, what timing is most strategic when selling tickets? To answer this, two things must be identified. The first is, which market (business or tourist) is the airline primarily selling to? The second is, which of three periods are the passengers purchase in? It seems like the ideal model sells a substantial portion of seats in the initial period to business persons who are less likely to cancel, then the model opens ticket purchases to the tourist market. Obviously, seating capacity is limited depending on the plane. When capacity is low airlines typically are “better off selling exclusively to business consumers, who have higher valuations and thus will pay more.” (Bischoff et al, 2011)
The final period of sales concerns last-minute deals, which is the idea that people can fill seats that would otherwise go unoccupied for a less expensive price than normal. This practice, while cutting into the profit that could be realized by the airline, usually cuts the costs that would have been seen by an unoccupied seat. This tool is usually only utilized when capacity is high and only on the day of the flight.
So in answer to what timing is most strategic when selling tickets? We found that airlines should price discriminate in the first period of purchasing, especially in the tourist market. In the second and third purchasing periods they should market moderately priced tickets to the business segment of the market, and finally in the day of, they need to utilize the most profitable preserving model of last minute deals for unfilled seats.
This pricing strategy, however, changes with increases and decreases in flight frequencies. (Cattaneo et al, 2018) The authors found that fare variations have a negative correlation with changes in the frequency of a flight. Simply said, frequent flights reduce an airline’s ability to price discriminate.
This leads to pricing strategies and consumer responses to them. Airlines will often charge more for a one-way ticket than for a round trip ticket causing consumers to buy a two-way ticket but skip the return trip to save money. (Bischoff, et al 2011) As one could imagine, carriers are not very fond of this practice and do their best to curb it while maintaining the same price discriminatory practices. These authors also note the history of air travel and how demand has settled into a seasonal pattern where most people travel for holidays. This leads into an analysis of basic elasticities of consumers. This means a consumer purchasing a last-minute round-trip flight typically is a very inelastic consumer, whereas a consumer buying tickets months in advance for a vacation flight is more elastic and responsive to price. Then the authors explain some price discrimination methods such as offering discounts to flyers booking flights well in advance as well as offering a discount for last minute customers to book their return flight with the same airline. This shift to a lower cost flight model has impacted the way that other higher cost airlines do business and for them to do some introspection on how they control prices. (Bischoff et al, 2011)
Additionally, airlines also price discriminate via the day of the week a ticket is purchased. Puller and Taylor (2012) found ticket prices to be lower on weekends than weekdays. They concluded this is due to people buying for leisurely purposes on weekends and thus are more price-elastic, in other words, people who are more sensitive to price changes.
One thing not yet answered is whether prices rise within certain hours of the day. Turns out there are higher prices during regular office hours, (the time businesspersons are most likely to buy) and lower prices in the evening (when vacationers are more likely to buy). “As the proportion of business travelers increases closer to departure, both price dispersion and price discrimination become larger.” (Escobari et al, 2019)
While much of the peer reviewed literature has information on the general timing of selling tickets, we want to find if there other pricing strategies. Particularly, are there are price changes within the few minutes one begins looking at a flight? The goal is to find a way to predict the best possible time to buy any given flight.
note When we were unable to find data publicly available for the above goal and question we formulated a new goal based on the data we did have access to, now focusing on how we can better understand the revenue and pricing strategies of airlines as a way to locate any potential loopholes or gaps that could be exploited by consumers.
All data was obtain via the US Federal Aviation Database systems. Source
Due to the size of the data this document is created with code that randomly samples 20,000 points from our 11 million, as a result the charts below sometimes vary thus we refrained from interpreting them. Although the full data set is represented by the tabular outputs. Additionally when viewing regression outputs we were able to code in references to outputs as the values change while interpretations are static. The patterns and relationships observed remain constant across samples due to the strength and significance of our variables as well as the large sample size
| Variable Name | Description |
|---|---|
| QUARTER | Quarter (1-4) |
| ROUNDTRIP | Round Trip Indicator (1=Yes) |
| ITIN_YIELD | Itinerary Fare Per Miles Flown in Dollars (ITIN_FARE/MilesFlown). |
| PASSENGERS | Number of Passengers |
| ITIN_FARE | Itinerary Fare Per Person |
| DISTANCE_GROUP | Distance Group, in 500 Mile Intervals |
| MILES_FLOWN | Itinerary Miles Flown (Track Miles) |
| ITIN_GEO_TYPE | Itinerary Geography Type, 0 = Contiguous Domestic (Lower 48 U.S. States Only) , 1 = Non-contiguous Domestic (Includes Hawaii, Alaska and Territories) |
tabl3 <-"
| Transformed Variable Name | Original Variable Name | Description |
|--------------------|:---------------:|:--------------------------:|
| lPASSENGERS | PASSENGERS | Log(PASSENGERS) |
| SQRT_1over_DG_x_MF | DISTANCE_GROUP & MILES_FLOWN | $\\sqrt{\\frac{1}{\\text{DISTANCE_GROUP} * \\text{MILES_FLOWN}}}$ |
"
tabl3 %>% pander()
| Transformed Variable Name | Original Variable Name | Description |
|---|---|---|
| lPASSENGERS | PASSENGERS | Log(PASSENGERS) |
| SQRT_1over_DG_x_MF | DISTANCE_GROUP & MILES_FLOWN | \(\sqrt{\frac{1}{\text{DISTANCE_GROUP} * \text{MILES_FLOWN}}}\) |
Overall these summaries are meant to help provide a simpler view of our over all model given the high number of dimensions present. A version of all these variables shown on just 2 plots is available in the ‘Regression Plot’ section.
ggplot(samp %>% drop_na()) +
geom_smooth(aes(x= lPASSENGERS, y=ITIN_YIELD, col=ROUNDTRIP)) +
#geom_jitter(aes(x= lPASSENGERS, y=ITIN_YIELD, col=ROUNDTRIP), alpha= 0.05) +
facet_grid(rows= ~ITIN_GEO_TYPE)+
theme_bw()+
labs(col = "Flight Type", title= "Yield by log(Passengers)")+
xlab("Log(Passengers)") + ylab("Fare per mile per passenger (Dollars)")
ggplot(samp %>% drop_na()) +
#geom_jitter(aes(x= DISTANCE_GROUP, y=ITIN_YIELD, col=ROUNDTRIP), alpha= 0.0075) +
geom_smooth(aes(x= DISTANCE_GROUP, y=ITIN_YIELD, col=ROUNDTRIP)) +
facet_grid(rows= ~ITIN_GEO_TYPE)+
theme_bw()+
theme(
panel.spacing = unit(0.5, "lines")
)+
labs(col = "Flight Type", title= "Yield by Distance")+
xlab("Distance in intervals of 500") + ylab("Fare per mile per passenger (Dollars)")
ggplot(data=samp, ) +
geom_histogram(aes(x=ITIN_YIELD, fill= ROUNDTRIP)) +
#geom_area(aes(x=HEPerGDP,y=child_mort, fill= continent))+
theme_bw() +
gghighlight(use_direct_label = FALSE) +
facet_wrap(~ITIN_GEO_TYPE) +
theme(
panel.spacing = unit(0.5, "lines"),
axis.ticks.x=element_blank()
)+
labs(fill = "Flight Type", title= "Distribution of Yields by Flight Types")+
xlab("Fare per mile per passenger (Dollars)") + ylab("Count")
Again these tables are meant to provide a similar glance at our data as a whole the the general distributions of the grouping and relationships we were examining.
pander(favstats(ITIN_YIELD ~ ROUNDTRIP + ITIN_GEO_TYPE, data=FullDat_Filt)[c("ROUNDTRIP.ITIN_GEO_TYPE", "Q1","median", "mean","Q3", "sd","n")], caption= "Summary table of Yields by Flight Type per Quarter")
| ROUNDTRIP.ITIN_GEO_TYPE | Q1 | median | mean | Q3 | sd |
|---|---|---|---|---|---|
| One-Way.Continguous Domestic | 0.1047 | 0.1738 | 0.2354 | 0.2908 | 0.2091 |
| RoundTrip.Continguous Domestic | 0.1015 | 0.1599 | 0.2063 | 0.2571 | 0.1657 |
| One-Way.Non-Continguous Domestic | 0.0709 | 0.1014 | 0.1423 | 0.1586 | 0.148 |
| RoundTrip.Non-Continguous Domestic | 0.0681 | 0.0942 | 0.1246 | 0.1337 | 0.1322 |
| n |
|---|
| 4025439 |
| 5803943 |
| 338790 |
| 438735 |
Here given the total of 25 distance groups we trimmed the output down to the shortest 5. middle 5 and longest 3.
pander(favstats(ITIN_YIELD ~ DISTANCE_GROUP, data=FullDat_Filt)[c(1:5, 12:16, 23:25),c("DISTANCE_GROUP", "Q1","median", "mean","Q3", "sd","n")], caption= "Summary table of Yields by Flight Type per Quarter")
| DISTANCE_GROUP | Q1 | median | mean | Q3 | sd | n | |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 0.3347 | 0.5259 | 0.6113 | 0.8075 | 0.3752 | 420249 |
| 2 | 2 | 0.1809 | 0.2847 | 0.3348 | 0.4358 | 0.2166 | 1675504 |
| 3 | 3 | 0.1301 | 0.2045 | 0.2357 | 0.3043 | 0.1472 | 1941759 |
| 4 | 4 | 0.1071 | 0.1659 | 0.1874 | 0.2405 | 0.1128 | 1759498 |
| 5 | 5 | 0.0901 | 0.136 | 0.1561 | 0.1984 | 0.09458 | 1631197 |
| 12 | 12 | 0.0601 | 0.0841 | 0.09356 | 0.1158 | 0.048 | 74074 |
| 13 | 13 | 0.0595 | 0.0819 | 0.08936 | 0.1104 | 0.04411 | 30455 |
| 14 | 14 | 0.0613 | 0.0829 | 0.09038 | 0.1134 | 0.04188 | 26418 |
| 15 | 15 | 0.0572 | 0.079 | 0.08533 | 0.1071 | 0.0393 | 14473 |
| 16 | 16 | 0.0611 | 0.0798 | 0.08534 | 0.1036 | 0.03466 | 22320 |
| 23 | 23 | 0.0552 | 0.06725 | 0.07131 | 0.0857 | 0.02555 | 258 |
| 24 | 24 | 0.0568 | 0.0681 | 0.07164 | 0.08505 | 0.02991 | 131 |
| 25 | 25 | 0.0631 | 0.0869 | 0.07687 | 0.1041 | 0.03513 | 377 |
Two regressions were created during our attempts to better understand the data and the relationships between our variables. The first uses at most simple transformations such as logs to help reduce heteroskedasticity. While the second employs more abstract calculus transformations in order to linearize any variable previously used that did not initially hold a simple linear pattern with our endogenous variable.
Variables
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = 1.5 )
text(.7, .8, Signif, cex=cex, col=2)
}
pairs(samp, lower.panel=panel.smooth, upper.panel=panel.cor)
\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{Distance Group} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:Distance Group} + \epsilon_i \]
lm1 <- lm(ITIN_YIELD ~ lPASSENGERS + DISTANCE_GROUP + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:DISTANCE_GROUP , data= samp)
summary(lm1) %>% pander
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.3562 | 0.00262 | 136 | 0 |
| lPASSENGERS | -0.03367 | 0.003506 | -9.604 | 8.538e-22 |
| DISTANCE_GROUP | -0.03532 | 0.0005174 | -68.26 | 0 |
| ROUNDTRIPRoundTrip | 0.05424 | 0.00263 | 20.62 | 1.735e-93 |
| ITIN_GEO_TYPENon-Continguous Domestic | 0.09018 | 0.005082 | 17.75 | 6.316e-70 |
| lPASSENGERS:DISTANCE_GROUP | -0.002049 | 0.0007624 | -2.688 | 0.007203 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 20000 | 0.1632 | 0.2247 | 0.2245 |
lm1_r2 <- round(summary(lm1)$adj.r.squared, 2)
lm1_RSE <- round(sigma(lm1)*100, 1)
matrix_coef <- summary(lm1)$coefficients
my_estimates <- matrix_coef[ , 1]
b0 <- round(my_estimates[1]*100, 2)
b1 <- round(my_estimates[2]*100, 2)
b2 <- round(my_estimates[3]*100, 2)
b3 <- round(my_estimates[4]*100, 2)
b4 <- round(my_estimates[5]*100, 2)
b5 <- round(my_estimates[6], 2)
Our initial regression model using ordinary least squares results in an \(R^2\) of 0.23, which in the scope of our data is fairly substantial, airline pricing is incredibly varies and involved hundreds of possible factors, we have access to a very limited number of factors and thus are only able to account for total variation to a very limited extent. Though, our residual Standard error is less than ideal when taken in context, an error of 16.3 cents in yields is a large percentage of our total yield range ($0.05-$2), 0.08% of our total range to be specific.
Skipping the y-intercept as its interpretation would make little realistic sense in this case, specific coefficient interpretations are as follows;
For every 1% increase in itinerary passengers we see a decline in yield of -3.37 cents
For every 500 additional miles on an Itinerary we see a -3.53 cent decline in yield.
Roundtrip flights on average provide an additional 5.42 cent yield.
Domestic (Non-Continguous) flights on average yield 9.02 cents more per mile.
For each 1% increase in passenger count we see a 0 decline in the distance of a flight.
As the data is not a time series we limited our testing to only Heteroskedasticity and multi-collinearity.
Below are the results from a Breush-Pagan Test:
bptest(lm1)
##
## studentized Breusch-Pagan test
##
## data: lm1
## BP = 618.54, df = 5, p-value < 2.2e-16
Despite the transformations made on passengers, significant error variance is still present. This is likely due to the increasing variability over increasing X as well as miss-specification errors due to omitting significant variables.
Due to concerns about high correlation between our variables we tested for Multi-collinearity as well:
vif(lm1)
## lPASSENGERS DISTANCE_GROUP
## 3.814814 1.689688
## ROUNDTRIP ITIN_GEO_TYPE
## 1.258845 1.307425
## lPASSENGERS:DISTANCE_GROUP
## 3.773666
As none of our values are greater than 10 we should not be worried about multi-collinearity.
In order to allow for a true BLUE regression we calculated the coefficients using robust least squares. As shown below the skeleton of the model remains the same though the methods used to calculate coefficients now apply a weighting system assigning less weight to outlying points than standard OLS.
\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{Distance Group} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:Distance Group} + \epsilon_i \]
As shown in the output below the relationships of our exogenous variables to our endogenous variable yield remain the same although the degree to which each of these variables affects the yield has somewhat shifted.
coeftest(lm1, vcov = vcovHC(lm1, type= 'HC1'))
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) 0.35617411 0.00371786 95.8008
## lPASSENGERS -0.03366833 0.00469865 -7.1655
## DISTANCE_GROUP -0.03531896 0.00067764 -52.1203
## ROUNDTRIPRoundTrip 0.05423850 0.00283882 19.1060
## ITIN_GEO_TYPENon-Continguous Domestic 0.09018426 0.00541778 16.6460
## lPASSENGERS:DISTANCE_GROUP -0.00204897 0.00098900 -2.0717
## Pr(>|t|)
## (Intercept) < 2.2e-16 ***
## lPASSENGERS 8.017e-13 ***
## DISTANCE_GROUP < 2.2e-16 ***
## ROUNDTRIPRoundTrip < 2.2e-16 ***
## ITIN_GEO_TYPENon-Continguous Domestic < 2.2e-16 ***
## lPASSENGERS:DISTANCE_GROUP 0.0383 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In addition the the simple robust estimates, due to the extremity of our Breush-Pagan results we felt it would also be useful to calculate 95% confidence intervals for our estimators and be doubly sure that they remained interpretable and useful. As shown all estimates retain the same signs and are thus safe to include and utilize in a model.
coefci(lm1, vcov = vcovHC(lm1, type= 'HC1'))
## 2.5 % 97.5 %
## (Intercept) 0.348886791 0.3634614239
## lPASSENGERS -0.042878064 -0.0244585977
## DISTANCE_GROUP -0.036647201 -0.0339907265
## ROUNDTRIPRoundTrip 0.048674187 0.0598028184
## ITIN_GEO_TYPENon-Continguous Domestic 0.079564955 0.1008035654
## lPASSENGERS:DISTANCE_GROUP -0.003987494 -0.0001104392
In this transformed model the non-simple linear relationship between distance group, miles flown and yields was transformed into a simple linear relationship, refer to variable overview. The implications of this are further expanded upon below.
\[ \underbrace{Y_i}_\text{Itinerary Yield} \underbrace{=}_{\sim} \overbrace{\beta_0}^{\stackrel{\text{y-int}}{\text{Base Yield}}} + \overbrace{\beta_1}^{\stackrel{\text{slope along}}{\text{lPassenger}}} \underbrace{X_{1i}}_\text{lPassenger} + \overbrace{\beta_2}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{2i}}_\text{SQRT_1over_DG_x_MF} + \overbrace{\beta_3}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{3i}}_\text{Roundtrip} + \overbrace{\beta_4}^{\stackrel{\text{change in}}{\text{y-int}}} \underbrace{X_{4i}}_\text{Non-Continguous} +\overbrace{\beta_5}^{\stackrel{\text{change in}}{\text{slope}}} \underbrace{X_{1i}X_{2i}}_\text{lPassenger:SQRT_1over_DG_x_MF} + \epsilon_i \]
So as to best maintain the ability to compare the two regression all variables where kept the same except for the replacement of Distance_Group with the new transformed variable.
lm2 <- lm(ITIN_YIELD ~ lPASSENGERS + SQRT_1over_DG_x_MF + ROUNDTRIP + ITIN_GEO_TYPE + lPASSENGERS:SQRT_1over_DG_x_MF, data= samp)
summary(lm2) %>% pander
| Estimate | Std. Error | t value | |
|---|---|---|---|
| (Intercept) | -0.007968 | 0.002632 | -3.028 |
| lPASSENGERS | -0.01453 | 0.002569 | -5.654 |
| SQRT_1over_DG_x_MF | 13.17 | 0.1119 | 117.7 |
| ROUNDTRIPRoundTrip | 0.07107 | 0.002136 | 33.28 |
| ITIN_GEO_TYPENon-Continguous Domestic | 0.01247 | 0.003821 | 3.263 |
| lPASSENGERS:SQRT_1over_DG_x_MF | -1.928 | 0.1176 | -16.4 |
| Pr(>|t|) | |
|---|---|
| (Intercept) | 0.002466 |
| lPASSENGERS | 1.587e-08 |
| SQRT_1over_DG_x_MF | 0 |
| ROUNDTRIPRoundTrip | 2.307e-236 |
| ITIN_GEO_TYPENon-Continguous Domestic | 0.001105 |
| lPASSENGERS:SQRT_1over_DG_x_MF | 4.951e-60 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 20000 | 0.1372 | 0.4519 | 0.4517 |
lm2_r2 <- round(summary(lm2)$adj.r.squared, 2)
lm2_RSE <- round(sigma(lm1)*100, 1)
matrix_coef <- summary(lm2)$coefficients
my_estimates <- matrix_coef[ , 1]
b0 <- round(my_estimates[1]*100, 2)
b1 <- round(my_estimates[2]*100, 2)
b2 <- round(my_estimates[3], 2)
b3 <- round(my_estimates[4]*100, 2)
b4 <- round(my_estimates[5]*100, 2)
b5 <- round(my_estimates[6], 2)
Our transformed regression model using ordinary least squares results in an \(R^2\) of 0.45, which in the scope of our data is fairly substantial, airline pricing is incredibly varies and involved hundreds of possible factors, we have access to a very limited number of factors and thus are only able to account for total variation to a very limited extent. Though, our residual Standard error is less than ideal when taken in context, an error of 16.3 cents in yields is a large percentage of our total yield range ($0.05-$2), 0.08% of our total range to be specific. The Primary issue with this is that we lose the ability to effectively interpret a change in distance due to the complexity of the transformation.
Skipping the y-intercept as its interpretation would make little realistic sense in this case, specific coefficient interpretations are as follows;
For every 1% increase in itinerary passengers we see a decline in yield of -1.45 cents
For every 1 unit increase in \(\text{(Miles Flown * Distance group)}^{-\frac{1}{2}}\) on an Itinerary we see a 13.17 dollar increase in yield.
Roundtrip flights on average provide an additional 7.11 cent yield.
Domestic (Non-Continguous) flights on average yield 1.25 cents more per mile, but are no longer significant.
For each 1% increase in passenger count we see a -1.93 unit decline in \(\text{(Miles Flown * Distance group)}^{-\frac{1}{2}}\) of a flight.
As the data is not a time series we limited our testing to only Heteroskedasticity and multi-collinearity.
bptest(lm2)
##
## studentized Breusch-Pagan test
##
## data: lm2
## BP = 3188.8, df = 5, p-value < 2.2e-16
Despite the transformations made on passengers and the attempt to linearize Distance, significant error variance is still present, in this case even more so than before. This is likely due to the increasing variability over increasing X as well as miss-specification errors due to omitting significant variables.
Due to concerns about high correlation between our variables we tested for Multi-collinearity as well:
vif(lm2)
## lPASSENGERS SQRT_1over_DG_x_MF
## 2.898553 1.542731
## ROUNDTRIP ITIN_GEO_TYPE
## 1.173857 1.045241
## lPASSENGERS:SQRT_1over_DG_x_MF
## 3.376770
As none of our values are greater than 10 we should not be worried about multi-collinearity.
Again due to the issues found in our assumptions we calculated Robust standard errors to use rather than traditional OLS.
As shown in the output below the relationships of our exogenous variables to our endogenous variable yield remain the same although the degree to which each of these variables affects the yield has somewhat shifted.
coeftest(lm2, vcov = vcovHC(lm2, type= 'HC1'))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0079681 0.0041967 -1.8987 0.05762
## lPASSENGERS -0.0145276 0.0035826 -4.0551 5.031e-05
## SQRT_1over_DG_x_MF 13.1726800 0.2588823 50.8829 < 2.2e-16
## ROUNDTRIPRoundTrip 0.0710741 0.0024009 29.6031 < 2.2e-16
## ITIN_GEO_TYPENon-Continguous Domestic 0.0124661 0.0031403 3.9697 7.220e-05
## lPASSENGERS:SQRT_1over_DG_x_MF -1.9277300 0.2431450 -7.9283 2.337e-15
##
## (Intercept) .
## lPASSENGERS ***
## SQRT_1over_DG_x_MF ***
## ROUNDTRIPRoundTrip ***
## ITIN_GEO_TYPENon-Continguous Domestic ***
## lPASSENGERS:SQRT_1over_DG_x_MF ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, in addition the the simple robust estimates, due to the extremity of our Breush-Pagan results we felt it would also be useful to calculate 95% confidence intervals for our estimators and be doubly sure that they remained interpretable and useful. As shown all estimates retain the same signs and are thus safe to include and utilize in a model with the exception of our intercept and geography types.
coefci(lm2, vcov = vcovHC(lm2, type= 'HC1'))
## 2.5 % 97.5 %
## (Intercept) -0.016194015 0.0002577284
## lPASSENGERS -0.021549815 -0.0075054692
## SQRT_1over_DG_x_MF 12.665249314 13.6801106800
## ROUNDTRIPRoundTrip 0.066368120 0.0757800324
## ITIN_GEO_TYPENon-Continguous Domestic 0.006310914 0.0186213397
## lPASSENGERS:SQRT_1over_DG_x_MF -2.404314287 -1.4511457550
With the HTML output you can manipulate these on your end (takes a bit of practice), but the one pattern that these really help show is the difference in the grouping of roundtrip flights (encoded as color) between Domestic and Non-Domestic flights.
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.025
#Setup Axis
axis_x <- seq(min(samp$DISTANCE_GROUP), max(samp$DISTANCE_GROUP), by = graph_reso)
axis_y <- seq(min(samp$lPASSENGERS), max(samp$lPASSENGERS), by = graph_reso)
axis_col <- as.factor(c("One-Way", "RoundTrip"))
axis_f <- as.factor(c("Continguous Domestic", "Non-Continguous Domestic"))
#Sample points
lmnew <- expand.grid(DISTANCE_GROUP = axis_x, lPASSENGERS = axis_y, ROUNDTRIP = axis_col, ITIN_GEO_TYPE = axis_f , KEEP.OUT.ATTRS=F)
lmnew$Z <- predict.lm(lm1, newdata = lmnew)
lmnew <- acast(lmnew, lPASSENGERS ~ DISTANCE_GROUP , value.var = "Z") #y ~ x
samp %>%
filter(ITIN_GEO_TYPE == "Continguous Domestic") %>%
plot_ly(.,
x = ~DISTANCE_GROUP,
y = ~lPASSENGERS,
z = ~ITIN_YIELD,
#text = rownames(samp %>% drop_na()),
type = "scatter3d",
mode ="markers",
color = ~as.factor(ROUNDTRIP),
alpha= 0.7) %>%
layout(title= list(text = "Continguous Domestic Flights (Lower 48)"))
samp %>%
filter(ITIN_GEO_TYPE == "Non-Continguous Domestic") %>%
plot_ly(.,
x = ~DISTANCE_GROUP,
y = ~lPASSENGERS,
z = ~ITIN_YIELD,
#text = rownames(samp %>% drop_na()),
type = "scatter3d",
mode ="markers",
color = ~as.factor(ROUNDTRIP),
alpha= 0.7) %>%
layout(title= list(text = "Non-Continguous Domestic Flights (Outside Lower 48)"))
To conclude, our findings are similar to that of other studies, consumers that purchase extremely far in advance are paying larger prices. Business class customers are paying larger prices because they are largely inelastic to price changes and airlines if not registering full the day of, will sell seats at a dramatic markdown to fill the plane. Price characteristics from the airlines’ perspective are that as distance grows larger, profit as a function of price per passenger decreases, which then follows with the fact that non-contiguous flights are more costly to the airlines than contiguous flights.
While our results are far from all-encompassing and exhaustive, our findings strengthened the overall consensus of the studies that have been conducted prior. Potential avenues for future study, if capable of being pursued, could be to look at consumer behavior when made aware of airline’s manipulation in the market. For instance, comparison of purchasing habits before and after being made aware of the tools available to consumers. Also, the potential impact on the market and how airlines may deal with this informed customer base.This field, assuming you can procure data to study it, is rich with potential analysis and study. We feel as though we have scratched the surface of the proverbial iceberg. This leaving many a study and investigation to be had in the area of airline ticket purchasing habits.